Optimizing Matrix-matrix Multiplication for an Embedded Vliw Processor

نویسنده

Roland E. Wunderlich

چکیده

The optimization of matrix-matrix multiplication (MMM) performance has been well studied on conventional general-purpose processors like the Intel Pentium 4. Fast algorithms, such as those in the Goto and ATLAS BLAS libraries, exploit common microarchitectural features including superscalar execution and the cache and TLB hierarchy to achieve near-peak performance. However, the microarchitectures of embedded processors typically use explicitly parallel in-order execution and have configurable memory hierarchies. Thus, approaches that find good MMM code for processors like the Pentium may not be as effective for embedded processors. For this project, I investigated the methods needed to achieve high performance MMM on an embedded VLIW (very-long instruction word) processor, the Texas Instruments C6713 floating-point DSP. This processor has three distinguishing features that affect an MMM implementation: an 8-wide in-order pipeline, an L2 mapped RAM, i.e., software-controlled scratch pad, and a direct memory access (DMA) engine. I present MMM implementations obtained through search and a model-driven approach that leverage the DSP microarchitecture. By using the scratch pad and DMA, I observed a 51% performance increase over a blocked MMM implementation.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Parallel Matrix Multiplication Method Adapted on Fibonacci Hypercube Structure

The objective of this study was to develop a new optimal parallel algorithm for matrix multiplication which could run on a Fibonacci Hypercube structure. Most of the popular algorithms for parallel matrix multiplication can not run on Fibonacci Hypercube structure, therefore giving a method that can be run on all structures especially Fibonacci Hypercube structure is necessary for parallel matr...

متن کامل

Designing Hardware/Software Systems for Embedded High-Performance Computing

In this work, we propose an architecture and methodology to design hardware/software systems for high-performance embedded computing on FPGA. The hardware side is based on a many-core architecture whose design is generated automatically given a set of architectural parameters. Both the architecture and the methodology were evaluated running dense matrix multiplication and sparse matrixvector mu...

متن کامل

Co-design of Compiler and Hardware Techniques to Reduce Program Code Size on a VLIW Processor

Code size is a primary concern in the embedded computing community. Minimizing physical memory requirements reduces total system cost and improves performance and power efficiency. VLIW processors rely on the compiler to statically encode the ILP in the program before its execution, and because of this, code size is larger relative to other processors. In this paper we describe the co-design of...

متن کامل

Matrix-Matrix Multiplications and Fault Tolerance on Hypercube Multiprocessors

Several new algorithms for matrix-matrix multiplications on hypercube multiprocessors are presented and evaluated based on the number of multiplications, additions, and transfers. The matrices ~I be multiplied are uniformly distributed to all processors of a hypercube system. Each processor owns some submatrices which are derived by dividing the source matrices. Each submatrix multiplication ca...

متن کامل

Performance of an embedded optical vector matrix multiplication processor architecture

An embedded architecture of optical vector matrix multiplier (OVMM) is presented. The embedded architecture is aimed at optimising the data flow of vector matrix multiplier (VMM) to promote its performance. Data dependence is discussed when the OVMM is connected to a cluster system. A simulator is built to analyse the performance according to the architecture. According to the simulation, Amdah...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2005

Optimizing Matrix-matrix Multiplication for an Embedded Vliw Processor

نویسنده

چکیده

منابع مشابه

A New Parallel Matrix Multiplication Method Adapted on Fibonacci Hypercube Structure

Designing Hardware/Software Systems for Embedded High-Performance Computing

Co-design of Compiler and Hardware Techniques to Reduce Program Code Size on a VLIW Processor

Matrix-Matrix Multiplications and Fault Tolerance on Hypercube Multiprocessors

Performance of an embedded optical vector matrix multiplication processor architecture

عنوان ژورنال:

اشتراک گذاری